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Abstract 


An important problem in geometric computing is defining and computing similarity 
between two geometric shapes, e.g. point sets, curves and surfaces, etc. Important 
geometric and topological information of many shapes can be captured by defining 
a tree structure on them (e.g. medial axis and contour trees). Hence, it is natural 
to study the problem of comparing similarity between trees. We study gapped edit 
distance between two ordered labeled trees, first proposed by Touzet [43]. 

Given two binary trees Ti and T 2 with m and n nodes. We compute the general 
gap edit distance in + im?n^) time. The computation of this distance in the 

case of arbitrary trees has shown to be NP-hard [43]. We also give an algorithm for 
computing the complete subtree gap edit distance, which can be applied to comparing 
contour trees of terrains in 
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Introduction 


1.1 Motivation 

An important problem in geometric computing is shape comparison, which concerns 
with defining and computing the similarities between two geometric shapes, e.g. 
point sets, curves, surfaces, etc. There are many applications of shape comparison. 
For instance, understanding how similar two point sets is critical in data mining 
and machine learning ([45]). Being able to measure the similarities between two 
curves help us recognize handwritings ([24, 39]) and plan motions of robots ([33, 
34]). Surface matching has applications in face recognition ([16]), image processing 
([29, 14, 13]) and even mathematical biology ([9]). 

The complexities of the shapes grow rapidly as their dimensions increase. To 
compare higher dimensional object, one technique that is often used is dimension 
reduction ([22], [45]) that “compresses” the objects to lower dimensions, and then 
perform the comparison. Many complicated shapes admit underlying tree structures 
that are much simpler but preserve some key topological and geometric properties 
of the original shapes. This suggests that we can compare these underlying tree 
structures and use that as a measure of similarities between the original shapes. 
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Here are two examples illustrating this point. 


1.1.1 Medial Axis 

Given an object S in with the Euclidean metric, it associated medial axis is a set 
of all points S that have more than one closest points on the boundary. This notion 
was hrst propose by Blum [7] as a tool for shape analysis in biology. In suppose 
the boundary of S' is a planer curve C, then the medial axis consists of all centers of 
disks contained in S that intersect C tangentially at least twice (see Figure 1.1). 



Figure 1.1: Medial axis of a planer object. Picture from http://www. lems .brown, 
edu/vision/Presentations/Wolter/figs.html. 

In particular if G is a piecewise linear polygonal curve, then the medial axis has 
a tree structure with vertices the same as the vertices of the boundary polygon. 

Medial axis can be viewed as a topological skeleton of an object, which is roughly 
obtained by “shrinking” the boundary points inward until the object is deformed to 
a treelike object. Medial axis captures some important geometric and topological 
information of this object, for instance connectivity, genus, geodesics, etc. 

The medial axis of an object is often used for shape compression and shape 
analysis. It can also be used for shape reconstruction if both the medial axis and the 
radii of the disks whose centers belong to the medial axis are known (also called the 
medial axis transform). 
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In higher dimensions, we can dehne medial axis similarly by replacing planer disks 
with higher dimensional balls. Moreover, we can also use various other norms (e.g. 
L^, L°°, etc.) which will give us different medial axises. The choice of norms depend 
on the particular applications. 

1.1.2 Persistence and Contour Tree 

Another example in which geometric objects have underlying tree structures is the 
contour tree of a terrain. A terrain in (see e.g. Figure 1.2) is the graph of some 
function / dehned on For instance, given a triangulation M of choose a 
function / dehned on the vertices of M. Linearly extend / to a function on The 
graph of / is then a piecewise linear triangulated surface in / is called a height 
function of this terrain. Figure 1.2 is obtained by extending / nonlinearly. 



Figure 1.2: Animated terrain drawn using polynomial height functions. 

Now imagine that we use a plane ^ = constant to slice this terrain, and we vary 
the value of z. The intersection of the terrain with the plane is called the level set of 
height z. The level sets could be connected or disconnected, or empty for 2 ; values 
that are too large or too small. In Figure 1.2 the water ponds can be viewed as the 
level sets (with multiple connected components) of the terrain. Notice that as we 
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increase the 2 :-values, the level sets change topology. For instance, some connected 
components will appear, some will disappear, and two components could merge into a 
single component; a single component can also split into two components. However, 
all these topological changes only occur at the critical points, which are points in 
the domain of the height function that correspond to local maxima, local minima 
and saddle points. A level set first appears when our slicing plane reaches a local 
minimum, and “dies” when the plane reaches a local maximum. For saddle points 
that looks like the valley between two mountains, a single component would then 
split into two components which would disappear when the plane reaches the top of 
the respective mountains. 

In this slicing process, many topological information about the terrain are ob¬ 
tained, including its elevation data, critical point distribution, and the evolution of 
the level sets. The contour tree is a graph (a tree, in fact) associated with the terrain 
that captures these information as we slice the terrain from the bottom to the top. 
The nodes of the contour tree are critical points of the terrain, and there is an edge 
( m , v) if there is a contour that appears at v and disappears at u, where a contour is 
a connected component of a level set (see Figure 3.9). 

1.2 Problem Statement 

In this thesis, we compare similarity between two trees. A well-studied distance 
between two ordered labeled trees is the classic tree edit distance ([47, 48]). Edit dis¬ 
tance measures the similarity between two trees by transforming one tree to another 
through pointwise edit operations include relabeling, insertion and deletion, one node 
at a time. Each operation has a prescribed nonnegative cost function, and the edit 
distance is dehned to be the minimum cost of transforming one tree to another via 
these operations. 

Gapped tree edit distance was hrst studied by Touzet [43], in which multiple 
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nodes, called gaps, are allowed to be inserted or deleted in a single edit operation. 
Moreover, the cost for such gaps is not necessarily linear. When the gap cost function 
is linear, gapped tree edit distance reduces to the classics edit distance. Touzet 
propose two models for gaps: the general model and the complete subtree model. 
He [43] proved that the general gap edit distance computation is NP-hard. 

The complete subtree model is rather restrictive, it is thus desirable to able to 
compute the general gap edit distance in certain cases. Thus, the central problem 
we consider is 

Problem Statement; Is is possible to get a polynomially computable general 
gap edit distance for a special class of trees, for instance, binary trees? 

We answer this question in the affirmative in Chapter 3. In particular, we prove 
that: 

Theorem 1.1 (Main Theorem). Given two ordered labeled binary trees Ti and T 2 
with vertex set Vi := V(Ti) and V 2 ■= ^^(^ 2 ) respectively, and an affine gap cost 
function. Let m := |I/i| and n := \V 2 \. The general gap edit distance between Ti and 
T 2 can be computed in 0{m^n^ + m^n^) time. If m ^ n, then the running time is 
0{wf). 

Touzet [43] gave an algorithm for computing the complete subtree gap tree edit 
distance. In Chapter 3, we give a different algorithm and prove that: 

Theorem 1.2 (Complete Subtree Gap Tree Edit Distance). Given two ordered la¬ 
beled binary trees Ti and T 2 with vertex set Vi := V (Ti) and V 2 := V(T 2 ) respectively, 
and an affine gap cost function. Let m := \Vi\ and n := \V 2 \. The complete subtree 
gap edit distance between Ti and T 2 can be computed in 0{rn^n^) time. If m ^ n, 
then the running time is 0{rn^). 
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1.3 Outline 


In Chapter 2, we study string comparison using edit distance as a motivation for 
tree edit distance and its generalizations. We give an overview of the classic tree edit 
distance, and present Zhang and Shasha’s algorithm [47] in detail. 

In Chapter 3, we study gapped tree edit distance, and two gap models proposed 
by Touzet [43]. We prove Theorem 1.1 and 1.2 using dynamic programming, which 
is motivated by sequence alignment algorithms and Zhang and Shasha’s algorithm 
[47]. We also discuss an application of the complete subtree gap model to terrain 
comparison via comparing similarity between their corresponding contour trees. 

Finally in Chapter 4, we summarize our results, and propose some open problems 
suitable for future projects. 

All trees considered in this thesis are ordered and labeled. 
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2 


Classic Tree Edit Distance and Related Problems: 

An Overview 


In this chapter, we study classic tree edit distance between two ordered labeled 
trees. In Section 2.1, we discuss the notion of edit distance and how it can be used 
to compare two sequences of characters. This motivates the classic tree edit distance 
as well as its generalizations discussed in Chapter 3. In Section 2.2, we dehne tree 
edit distance and other terminologies that will be used in this and later chapters. 
In Section 2.3 and 2.4, we review Zhang and Shasha’a algorithm [47] for computing 
tree edit distance in detail. 

2.1 Edit Distance and Sequence Alignment 

Edit distance was hrst used to measure the similarities between two strings, which 
are sequences of characters. It has several different definitions, and of the most 
commonly used variant is the Levenshtein distance. In this version, one string is 
transformed to another via a sequence of edit operations that include insertion, 
deletion and substitution. Each edit operation has a positive cost, and the distance 
between two strings is dehned to be the minimal cost of transforming one string to 
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another. A graphical representation of this transformation is given by an alignment 
between these two strings, which is a way of placing one string on top of another 
so that a one-to-one correspondence among the characters is created with deleted 
characters aligned with a special character denoted as a blank. 

Example 2.1.1. let Si = “save” and S 2 = “salvage”, then a possible alignment is: 

s a - V - - e 

salvage (2.1.1) 

The cost of an alignment is given by the cost of the corresponding transformation. 
Thus, computing the edit distance is equivalent to finding the optimal alignment 
between two strings. 

For the cost of substituting one character with another, one can choose a metric 
(symmetric, positive dehnite and satisfies the triangle inequality) p such that p{i,j) 
is the cost of changing character i to j. In particular, p{i,j) = 0 if and only if 
i = j. To penalize the deleted the characters, it is equivalent to penalize the blank 
characters in an alignment: 

Definition 2.1.1. A gap of a seguenee in an alignment is a largest eonseeutive blank 
eharaeters. 

In Example 2.1.1 above, we have two gaps: one of size one and another of size 
two. For some applications (e.g. computational biology), it is more likely to have a 
gap of size /c > 0 than having k isolated gaps, each of size 1 (see [36]). Thus for the 
cost of gaps, it is desirable to have a function w such that 

w{k) ^ kw{l), 

or in general 

w{ki -I- /C 2 ) < w{ki) + w{k2), ki, /c2 e ■ 

Such w is called a eonvex function. In particular: 



Lemma 2.1.1. An ajjrne function 


w{k) 


0 for /c = 0 

a + bk for k e 


is convex if 0, b > 0. 

Proof. For any ki,k 2 e , we have 


( 2 . 1 . 2 ) 


w{ki + k 2 ) = a + b{ki + k 2 ) < 26 + a{ki + ^ 2 ) = 'u^iki) + w{k 2 ), 


since 6 > 0. □ 

Using the affine gap cost function above, if a blank character is starting a gap, 
we penalize it with (a + 6), and if it is continuing a gap, we only penalize it with b. 
Now given an alignment M of strings Si and 5*2, its cost is given by 

7(M) := Yj P{hj)+ Yj^a + b\9\), (2.1.3) 

i is matched with j, both nonblank qgG 

where G is the set of all gaps in M, and \g\ is the size of the gap g. 

Example 2.1.2. Suppose p{i,j) = ^if'i=j, and p{i,j) = 1 otherwise. Choose 
w{k) = a + bk. Then the alignment in example (2.1.1) has cost (a + 6) + (2a + b) = 
3a + 2b. 


Now we focus on the problem of computing the edit distance between S'! and S 2 . 
Our presentation is based on [36]. 

Let m := |*S'i|,n := |5'2|, the number of characters in Si and S 2 , respectively. Let 
S'i[i] be the prehx of Si consists of the first i characters. Dehne 5*2 [j] similarly for 
S 2 , 1 < i < m, and 1 ^ j < n. To compute an optimal alignment with affine gap 
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cost function (2.1.2), we define three auxiliary functions: 

•= cost of aligning S'i[f] with 5*2[j] that ends with matching i with j 
Q_L*[h i] := iciiii cost of aligning S'i[f] with 5*2 [j] that ends with matching a blank 
■\ node with j 

Q^iXhj] '■= min cost of aligning 5'i[f] with 5*2[j] that ends with matching i with 
a blank node 

V 

Theorem 2.1. For 1 < i < m and 1 < j < n, and gap cost function w{k) = a + bk. 
The matrices Q*i. defined above satisfy the follow recurrence relations 

(with initializations given in the proof below): 


Q**[i,j] = P{i,j) + min ^ 


Q 

Q L* \l 




(2.1.4) 


<5_L*[bi] = min ^ 


Q**[hj - 1] + (a + &) 
-i] + b 


[Q*L[i,j - 1] + (a + 6) 


starting a new gap 
continuing a preexisting gap 
starting a new gap 


(2.1.5) 


Q*L[hj] = min 


Q**[i - j] + {a+ b) 
Qx*[i - 1, j] + {a+ b) 
Q*Ai - 1, j] + b 


starting a new gap 
starting a new gap 
continuing a preexisting gap 


( 2 . 1 . 6 ) 


The minimum cost Q\m,n\ of aligning Si with S 2 is given by 


Q[m,n] = min{(5**[m,n],(5_L=i=[m,n],(5*_L[m,n]}. 

Proof. We hrst verify the above claim for i,j > 1. For the recursion of j], the 

alignment ends with i aligned with j, therefore no matter how S'i[i — 1] was aligned 
with S 2 [j — 1], we simply add the penalty p{i,j) to the previous cost, and there is 
no gap issue to worry about. 
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For the recursion of Q_L*[f, j], the alignment ends with a blank node aligned with 
the node j. If 5'i[f] matched 5*2[j — 1] ending in i aligned with j — I, then this empty 
node is the beginning of a new gap, hence we penalize it with a + b. If a blank node 
was aligned with j — 1 in the previous step, then we are continuing a preexisting gap, 
hence we only penalize it by a. The only case left is when i was aligned to a blank 
node in the previous step, therefore this is the beginning of a new gap, hence the 
penalty a + b. 

The argument for is completely symmetric. 

Now it only left to show the above holds for i,j = 1. This requires us establishing 
appropriate initial values for Q**, Q_l*, Q*_l: 


^Q**[0,0] = 0 


^ Q*Ah 0] = +00 

for 1 < i ^ m 

^Q**[0,j] = +00 

for 1 ^ j ^ n 

\QLAh0] = +“ 

for 0 < i < m 

[QL*[0,j] = a + bj 

for 1 ^ J < n 

j Q*i[h 0]= a + bi 

for 1 < i < n 

\Q*d00] = +00 

for 0 < J < n 


(2,1.7) 


( 2 , 1 . 8 ) 


(2,1.9) 


Here 0 stands for a void sequence. We set 0] and Q**[0,j] to inhnity 

since we cannot match a nontrivial node to a node in a void sequence. Similarly 
(5_L*[b0] = GO because there won’t be any node in S 2 for a blank node in Si to 
be aligned to. However Q_l*[ 0, j] = a + bj since there is a unique way to match 
a void sequence with a sequence with j characters: use a gap with j blank nodes. 
The initial values of Q^i_ are assigned in a similar manner. Since we are taking the 
minimum, the inhnite values do not affect our computation. □ 
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Using the above recurrence relations, a straightforward dynamic programming 
algorithm computes the edit distance in 0{mn) time with 0(jnn) space (which can 
be further improved to linear space, see [36]). In the case that the gap cost function is 
arbitrary, similar recurrences can be obtained that compute the distance in 0{m‘^n + 
mv?) time. 

A string can be thought of as a tree with a single leaf. In fact, many tree 
algorithms specialize to strings, and are as efficient as the best string algorithms [48]. 
Now the question is: Can we generalize the string edit distance for strings to trees? 
The idea is again to transform one tree to another via a sequence of edit operations, 
and the distance is dehned to be the minimal cost of such transformations (Precise 
dehnition see Section 2.2 below). In 1977, Selkow [35] first attempted to generalize 
string edit distance to ordered trees. Later in 1979, Tai [40] gave the first definition of 
edit distance between two ordered labeled trees, and the hrst polynomial algorithm to 
compute it. Many variants have been extensively studied, e.g. edit distance between 
unordered trees [49], tree alignment problem [25], tree inclusion problem [26], etc. 

Given two ordered labeld trees Ti and T 2 with m and n nodes, respectively. A 
straightforward dynamic programming algorithm computes the edit distance in time 
0{m^n‘^). In 1989, Zhang and Shasha [47] computed the edit distance in 0{mn ■ 
min{i7i, Li} ■ min{i72, L 2 }) time with space 0{mn), where Di is the depth of Tj, and 
Li is the number of leaves of Tj, i = 1,2. However, the worst case running time is 
still 0{m?n^). In 1998, Klein [27] modihed Zhang and Shasha’s algorithm using path 
decompositions, and improved the running time to 0{m?n\ogn). In 2001, Chen [12] 
gave an algorithm that compute the distance in 0{mn + L\n + L\-^L 2 ) time. In 
2003, Dulucq and Touzet [19] computed the distance in O(mnlog^n) time. In 2009, 
Demaine et al computed the edit distance in 0{m?n) time. 

For the remainder of this chapter, we study Zhang and Shasha’s algorithm in 
detail. 
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2.2 Zhang and Shasha’s Algorithm Part I: Setup 


In this section, we define the edit distance between two ordered labeled rooted trees. 
A tree is said to be ordered if for each node, we can put a left-to-right order on its 
siblings. Every tree embedded in has a natural order after we fix an arbitrary 
vector and simply determine the order of the nodes with a sweep out along the 
direction of that vector. A tree is called labeled if each node has an assigned symbol 
taken from a finite alphabet S. 

The edit operations are 

(1) Rename. To rename one node label to another. 

(2) Delete. To delete a node u, and all children of u become children of the parent 
of u, while maintaining the order. 

(3) Insert. To insert a node m as a child of u'. A consecutive sequence of children 
of u' now becomes the children of u. 

Represent an edit operation as a pair of nodes (a, 6 ), or as a ^ b, to indicate that 
we relabel the node with a by b. Introduce a special label A that is not in S, so that 
(a. A) or a A indicates the deletion of a, and similarly (A, b) or \ ^ b indicates the 
insertion of b. We sometimes identify a tree node with its label whenever there is 
no ambiguity. Consider two trees A and T 2 . Let V(Ti) and V{T 2 ) be the respective 
vertex set. Given a distance function (positive definite, symmetric, and satisfies the 
triangle inequality) 7 : V{Ti) u {A} x V(T 2 ) u {A} —> we call such 7 a cost 

function on edit operations. Now let 5 be a sequence si, S 2 , ■ ’ ’ of edit operations, 
and each of them has a cost, and thus the cost of S is the sum of the all the costs: 

|S| 

7(S):=XiTW. (2.2.1) 
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where [S'! denotes the number of edit operations in S. The tree edit distance is 
dehned as: 

Definition 2.2.1. 6 (Ti,T 2 ) := min{ 7 (S')|S' is an edit operation sequenee taking Ti to T 2 }. 

Since <5 is a hnite sum of positive dehnite distance functions, itself is a positive 
dehnite distance function as well. 

The tree edit operations give rise to a mapping that is a (equivalent) graphical 
representation of what edit operations apply to each node in the two trees. Let 
Ti and T 2 be two labeled ordered trees with Ni and N 2 nodes, respectively. Fix a 
traversal rule (e.g. postorder to preorder), we dehne T\i\ as the node of T in 
the traversal. Now with this traversal hxed, we can identify the node T\i\ with the 
number i. 

Definition 2.2.2 (Mapping Between Two Trees). A mapping between Ti and T 2 is 
a triple (M, Ti,T 2 ), where M is any set of pair of integers {i,j) where 1 ^ ^ Ni, 

1 ^ j ^ N 2 , sueh that for any pair of {ii,ji) and (* 2 , 12 ) in M: 

(1) ii = *2 if and only if ji = j 2 - This is ealled the one-to-one eondition. 

(2) Ti[ii] is to the left ofTi\i 2 \ if and only ifT 2 [ji] is to the left ofT 2 [j 2 ]- This is 
ealled the sibling order eondition. 

(3) Ti[ii] is an aneestor o/Ti[i 2 ] if and only ifT 2 [ji] is an aneestor o/T 2 [j 2 ]- This 
is ealled the ancestor order eondition. 

M can be viewed as an order (sibling order and ancestor order) preserving map¬ 
ping taking (a subset of) vertices from one tree to that of another tree. We say that 
a node is not touehed if it does not appear as either one of the vertices in the domain 
of M. If we draw a line connecting the two nodes in each pairs that lie in the domain 
of M, then nodes that are not touched do not have lines coming in or going out. Let 
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/ and J be the sets of nodes in Ti and T 2 respective representing those nodes that 
are not touched. 

Definition 2.2.3. The cost of such a mapping, with 7 previously given as above, is 
defined as: 

7(A/):= 2 7(Ti[i],T,[i])+X;7(*.A)+X;T(A.i). (2.2.2) 

iel jej 

Then it is easy to show that 

Lemma 2.2.1. Given an edit operation seguence S from Ti to T 2 , there exists map¬ 
ping M from Ti to T 2 such that 7 (M) < 7 (*S'). Conversely, for any mapping M, 
there exists a seguence of editing operations such that 7 ( 6 ') = 7 (M). Therefore: 

( 5 (Ti,T 2 ) = niin{ 7 (M)|M is a mapping from Ti to T 2 }. (2.2.3) 

The above lemma implies that in order to compute the edit distance, it suffices 
to understand all mappings that are order preserving. Therefore in the following, we 
will switch our perspective from transforming one tree to another, to mapping the 
nodes of one tree to the nodes of anther. To do that, we need some terminologies: 

1. From now on we £x the left-to-right postorder traversal rule unless otherwise 
specified. This rule defines a numbering among all the nodes of a tree T. 

2. Let T[i] denote the node (so that we can identity T[i] with i), and l{i) be 
the number of the leftmost leaf descendant of the subtree rooted at T[i]. Hence 
when T[i] is a leaf, l{i) = i. 

3. The parent of T[i] is denoted p{i), and anc{i) denotes the ancestors of T[i], 
and desc{i) the descendants of T[i]. 

4. Let forest{i..j) := T[i..j] be the ordered subforest of T induced by the nodes 
numbered from i to j inclusively. If i > j, then T[i..j] = 0. 
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5. In particular, T[l.i] will be referred to as forest{i), when the tree T is clear 
in context. Note that T[l{i)..i] is simply the subtree rooted at T[i], and thus 
will be referred as tree{i). 

6. We use Size{i) to denote the number of nodes in tree{i). 

2.3 Zhang and Shasha’s Algorithm Part II: Recurrences 

Zhang and Shasha idea of computing 5 {Ti,T2) is the following: 

We use dynamie programming starting from the distances between smaller com¬ 
ponents ofTi and T 2 , and build up from that. Portions of a tree is in general a forest, 
thus it is important to understand the distances between such forests. We build up 
the tree from the right most node in the postorder traversal in a bottom-up fashion. 


Definition 2.3.1. We define 

forestdist(i'..i,j'..j) := forestdist{Ti[i..i],T 2 [f..j]) := 5{Ti[i ..i],T 2 [f ..j]), 
where 5 is defined same as before. And 

forestdist{i,j) := for-estdist{l..i,l..j). 

Finally, the distance between the subtrees rooted at i and j respectively is denoted as 

treedistii, j) = forestdist{l{i)..i,l{j)..j). 

The dynamic programming algorithm design is based on the following key recur¬ 
sions: 
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Theorem 2.2 ([47]). For any i e desc{ii) and j e desc{ji)*, then: 


forestdist{l{ii)..i,l{ji)..j) = min ■( 


forestdist{l{ii)..i - 1 , + 7(^1 W ^ -^) 

forestdist{l{ii)..iJ{ji)..j - 1 ) + 7(A T2[j]) 

forestdist{l{ii)..l{i) - 1 ,- 1 ) 

+forestdist{l{i)..i — — 1 ) + 7(Ti[i] — 


T2[3\) 


Proof. First note that since i e desc{ii), l{ii) ^ i < ii. Similarly, /(ji) ^ j ^ Ji- 
To prove this claim, it suffices to hnd a minimum-cost map between forest{l{ii)..i) 
and forest{l{ji)..j). Notice that i and j are the rightmost nodes of the two forests 
respectively, and there are three possible configurations of i and j in any mapping 
M: 


1. i is not touched by a line in M. Then {i, A) e M. Thus 

forestdist{l{ii)..i,l{ji)..j) = forestdist{l{ii)..i - -l-7(Ti[i] ^ A). 

2. j is not touch by a line in M. Then similar as above, 

forestdist{l{ii)..i, = forestdist{l{ii)..i, l{ji)..j - 1) + 7(A ^ T 2 [j]). 

3. Both i and j are touched by a line (see Figure 2.1). This is the only non-trivial 
case, and we claim that in this case, (i,j) e M, i.e., i must be mapped to j. 
To prove the claim, we suppose the contrary: suppose {i,k) and {h,j) are in 
M, and h ^ i,k ^ j. Thus either /(/i) < h ^ l{i) — 1, or l{i) < h < i — 1. 
The first case implies that i is to the right of h, so k must be the right of 
j by the sibling condition on M. But there is no such k in forest{l{ji)..j). 
Contradiction, and this forces /(i) < h < i — 1, i.e., i is a proper ancestor of h. 
By the ancestor condition, /c is a proper ancestor of j, which is again impossible 

* Here we have identified ii with the node in Ti, given the postorder numbering. Same for 
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in forest{l{ji)..j). Therefore h = i. By a symmetric argument, k = j as well. 
Therefore i is mapped to j, and by the ancestor condition on M, the subtree 
rooted at i must be mapped into the subtree rooted at j. Therefore the last 
case follows: 

forestdist{l{ii)..i,l{ji)..j) = forestdist{l{ii)..l{i) — 1,— 1) 

+ forestdist{l{i)..i - - 1) + 7(7'i[f] ^ T 2 \j]). 



Figure 2.1: Both i and j are touched by a line. In this case, i must be mapped to 

J- 


□ 


Theorem 2.2 has the following corollary: 


Corollary 2.2.1 ([47]). For any i e desc(ii),j e desc{ji), 


1. If l{i) = and l{j) = i.e., i is on the path from ii to its leftmost leaf 

l{ii), and j is on the path from ji to leftmost leafl{ji), then 
treedist{i,j) = forestdist{l{ii)..i,l{ji)..j) 


= min 


forestdist{l{ii)..i - 1, l{ji)..j) + 7(^1 W ^ A) 
forestdist{l{ii)..iJ{ji)..j - 1) + 7(A T 2 [j]) 

forestdist{l{i)..i - l,/(j)..j - 1) + 7(^i[^] 7^2[j]) 


18 





2. Ifl{i) ^ orl{j) ^ l{ji) 


f orestdist{l{ii)..i, l{ji)..j) = min •( 


forestdist{l{ii)..i - 1, + 7(^1 W ^ A) 

forestdist{l{ii)..iJ{ji)..j - 1) + 7 (A T 2 [j]) 

forestdist{l{ii)..l{i) — 1,— 1) + treedist{i, j) 


Proof. The first part is easy: if l{i) = and /(j) = /(ji), then T[l{ii),i] = 

T[l{i)..i] = tree{i). Similarly T[/(ji)..j] = T[/(j)..j] = tree{j), thus the hrst equality 
in part (1) follows. The rest of part (1) follows from the fact that forestdist{l{ii)..l{i) — 
- 1) = forestdist{0,0) = 0. 

For part (2), note that 

forestdist{l{ii)..i,l{ji)..j) ^ forestdist{l{ii)..l{i) — 1,/(ji)../(j) — 1) + treedist{i,j), 

since the latter formula represents a particular (and therefore possibly suboptimal) 
mapping of forest{l{ii)..i) to forest{l{ji)..j). For the same reason, 

treedist{i0) < forestdist{l{i)..i — + 7 (f —» j). 

Therefore forestdist{l{ii)..l{i) — 1,— 1) + treedist{i0) is a tighter upper 
bound for forestdist{l{ii)..i,l{ji)..j). Since we are looking for the minimum value 
of forestdist{l{ii)..i,l{ji)..j), we can use a tighter upper bound with affecting the 
result. □ 

The above theorem and corollary serve as the basis for using dynamic program¬ 
ming to compute tree edit distance. More precisely, Theorem 2.2 implies that in order 
to compute treedist{ii0i) = forestdist{l{li)..ii, we need in advance almost 

all values of treedist{i, j) for i e desc{ii), and j e desc{ji), as long as l{i) ¥= l{ii) or 
/(j) ^ This suggests a bottom up approach for computing treedist{ii0i)\ 


^ pick i = ii and j = ji in the above. 
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Compute treedist{i, j), for i = ■ ,ii, and j = ■ ■ ■ , ji- The number of 

all such pairs is on the order of where Ni is the number of nodes in tree{ii), 

and N 2 is the number of nodes in tree{ji). 

However, Corollary 2.2.1 suggests that we don’t have to compute all such inter¬ 
mediate distance of subtree pairs. Given two subtrees tree{i) and tree{j), to actually 
compute the distance treedist{i, j), we need the distance between all the prefixes of 
the two subtrees, where a prehx of a tree is the result of deleting the rightmost node 
in the postorder numbering. We can keep deleting the rightmost nodes to get all the 
prehxes. 

Now if i is in the path from /(H) to ii, and j is in the path from l{ji) to j, then 
in computing treedist{ii,ji), we get treedist{i, j) as a byproduct, since tree{i) and 
tree{j) are prehxes of tree{ii) and tree(ji), respectively. Thus the upshot is this: In 
computing the distance of subtree pairs, we can skip those pair in which each subtree 
is the prefix of some super tree whose root is an ancestor of the root of this subtree. 
It is easy to see that those are exactly subtrees rooted at nodes that do not have a 
left sibling. This motivates the following dehnition: 

Definition 2.3.2 ([47]). Given a tree T, we define the set of LR key roots ofT to 
be the union of the root ofT, together with all nodes that have left siblings. Here LR 
refers to the left-to-right postorder numbering. 

Therefore all we need to compute are distances between pairs of snbtrees rooted 
at LR key roots. We formulate the pseudocode as follows (see Algorithm 1); 

To compute each treedist{i, j), the farestdist values computed and used here are 
put in a temporary array that is freed once the corresponding treedist is computed. 
The treedist values are put in the permanent treedist array. The compntation of 
tree{i, j) is again bottom-np; starting from the smallest prehxes of tree{i) and tree{j) 
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Algorithm 1 Dynamic Programming Algorithm for Tree Edit Distance 
Input: Tree Ti and T 2 

Output: treedist{i, j), where 1 < i ^ |Ti|, and 1 < j < IT 2 I 

Preprocessing: Compute the I function, and the LR key roots for Ti and T 2 , put 
them in the array KRi and KR 2 respectively 
for (i' = 1 —» KRi.sizeO) do 
for (j' = 1 ^ KR 2 .size{)) do 

i = KRi[i']-, i> (pick the i’th key root) 

j = KR 2 [j']] i> (pick the j’th key root) 

Compute treedist{i, j) using forestdist{i', j') for 1 < 1 < j' < j; 

end for 
end for 


and build up. The details are given in [47], p.l253. 

2.4 Zhang and Shasha’s Algorithm Part III: Algorithm Complexity 
Analysis 

We hrst bound the size of the set of key roots in a tree. 

Lemma 2.4.1 ([47]). The set of LR key roots of a tree T is less than or equal to the 
number of leaves ofT. 

Proof. We show that for distinct key roots i and j have distinct leftmost leaf de¬ 
scendants l{i) and /(j), respectively, thereby proving the claim. Suppose not, and 
without loss of generality assume that i < j. Then i is on the path from l{j) to 
j. From the dehnition of /(j), i does not have any left siblings, contradicting the 
assumption that i is a key root. Therefore l{i) ¥= I (j). □ 

The complexity of the above algorithm is rooted in the number of pairs of subtrees 
whose distance are being computed. For any node i in T, we say that it participates 
the algorithm computation if it belongs to such a subtree, rooted at a key root. 
Then it is easy to see that the number of times any given node participates equals 
the number of its key root ancestors. We dehne the quantity to be the collapsed 
depty of V. 
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Definition 2.4.1 ([47]). The collapsed depth of a node i, denoted as cdepth{i), is 
given by the number of the key root ancestors of i, including i if it is a key root. And 
we set cdeptyiT) := ms^^cdepth{i). 

Then by Lemma 2.4.1, 

cdepthiT) m\n{depth{T),\leaves{T)\). (2-4.1) 

Now we bound the total number of participating nodes: 

Lemma 2.4.2 ([47]). Let K be the number of LR key roots of T, and N be the 
number nodes of T, then 

K N 

'^Size{i) = '^cdepth{j). (2.4.2) 

i=l jr = l 

Proof. Note that the left hand size in (2.4.2) is exact the total number of participating 
nodes, counted with multiplicity, in the computation of the tree edit distance as one 
of the trees. Note that each participating node i is counted cdepth{i) times in the 
left summation. Moreover, each node j such that cdepth{j) > 0 is participating. 
Therefore the two summations agree. □ 

Now we are in a position to bound the running time and space usage of the 
algorithm 1; 

Theorem 2.3 ([47]). The above algorithm in computing the edit distance between Ti 
and T 2 takes time 

0^|Ti||T2| • mm{depth{Ti), |/eanes(Ti)|) • mm{depth{T 2 ), |/eanes(T 2 )|)^, (2.4.3) 

and space 

o(|Ti||T2|). (2.4.4) 
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Proof. For the space complexity, we use an array to keep the key roots, treedist 
values and forestdist values, each takes 0 (|Ti||T 2 |) space. 

For the time complexity, the preprocessing takes linear time in computing I and the 
key roots. In the main loop, we are computing treedist{i, j) for each 1 < i < i^i, 
and 1 < j < K 2 , where Ki and K 2 are the size of the LR key roots of Ti and 
T 2 respectively. treedist{i, j) takes time Size{i) ■ Size{j), since that’s the number 
of pairs of all prefixes of tree{i) and tree{j). Therefore the running time after the 
preprocessing is: 

Ki K2 /Ki \ 

TT Size{i) ■ Size{j) = ( '^Size{i) | ( Size{j) 

i=ij=i V*=i / \i=i 

Ni \ f 

adept{i) j | '^cdept{j) j (By Lemma 2.4.2) 

*=i / Vi=i / 

< |Ti| |T2|c(iept/i(Ti) • cdepth{T2) 

< IT 1 IIT 2 I ■ mm{depthiTi), |/eaues(Ti)|) • mm{depthiT 2 ), |/eaues(T 2 )| 

where the last inequality follows from Lemma 2.4.1. This concludes the proof of the 
theorem. □ 
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3 


Tree Edit Distance with Gaps 


In this chapter, we study edit distance between trees with gaps, in particular, gap 
models, and gap cost functions. The classic edit distance can be viewed as gaped 
edit distance with linear gap costs. 

In Section 3.1, we discuss motivations for introducing gaps in comparing tree 
similarities. In Section 3.2, we compute the general gap edit distance with affine gap 
costs between two binary trees of size m and n respectively in 0{m^n^ + time. 

The computation of this distance between general trees is shown to be NP-hard (see 
[43]). 

In Section 3.3, we study the complete subtree gap model, which is a weaker 
model first proposed by Touzet [43]. We present an algorithm that computes the 
corresponding edit distance with affine gap costs in 0{m?n‘^) time. In Section 3.4, we 
discuss an application of the complete subtree gap model to contour tree comparisons. 
Finally in Section 3.5, some further improvements are discussed. 

We assume that all trees considered in this chapter are ordered and labeled with 
characters taken from a hnite alphabet S. We use T to denote a special characters 
outside S. 
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3.1 Motivations and Main Results 


Recall the one main motivation for studying tree similarity comparison is that many 
(complicated) geometric shapes have (simpler) underlying tree structures that cap¬ 
ture some key topological or geometric properties of the original shapes. Thus, the 
problem of shape comparison can be reduced to tree comparison. However, in many 
applications, such geometric shapes often have noise present in their input, which 
often get reflected in the underlying tree structures. Therefore, it is desirable to 
delete such “auxiliary” portions in trees, which do not represent the true topology 
or geometry of the original shapes, before comparing them. 

The classic tree edit distance allows pointwise deletion, i.e. one node at a time. 
There are two natural generalizations. First, in addition to pointwise insertion or 
deletion, multiple nodes could be inserted or deleted. Second, instead of charging 
every deleted node equally, more general cost functions could be used. 

In this more general version of tree edit distance, nodes can be inserted or deleted 
in groups, called gaps, which is analogous to gaps in sequence alignment (see Dehni- 
tion 2.1.1). Moreover, we can consider affine (or more generally convex) functions for 
gap costs. This is again motivated by the fact that in some applications, it is more 
probable to have a “big” noise than several “small” noise scattered in the input. 

What is a good model (e.g. intrinsic and computable) for gaps in trees? One 
natural dehnition of a gap is a connected component of the nodes deleted. This is 
analogous to the sequence alignment case, in which gaps are largest consecutive nodes 
deleted. This gap model, referred to as the general gap model in this thesis, was hrst 
proposed by Touzet [43], whose motivation at the time came from the problem of 
comparing secondary structures of RNA. 

Unfortunately, Touzet [43] showed that, even with affine gap cost functions, the 
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computation of this general gap edit distance is NP-hard*. For this reason, gapped 
edit distance with nonlinear gap cost functions has received fewer studies than the 
classic edit distance, which can be viewed as gapped edit distance with linear gap 
costs (despited of the choice of gap models). Rolf Backofen et al [4] studied the 
application of edit distance with gaps in RNA comparison. S. Schirmer and R. 
Giegerich [32] studied tree alignment with affine gaps that concerns the problem of 
optimal embedding two trees into a common tree, first proposed by T. Jiang, L. 
Wang and K. Zhang in [25]. G. Blin and H. Touzet [6] studied the application of 
tree alignment in computational biology. 

Here is the central question we consider in this thesis: Even though computing 
the general gap tree edit distance is NP-hard, is it possible to weaken this distance 
and get a computable measure of similarity between two trees? In the following, 
two ways to weaken the general gap edit distance are considered. We could either 
compare more specific trees (e.g. binary trees), or use a more restrictive gap model 
(e.g. complete subtree model). It turns out that both of these two approaches yield 
polynomially computable distances. 

3.2 General Gap Tree Edit Distance Between Binary Trees 

3.2.1 Genera Gap Model, Edit Distance and Mapping 

Definition 3.2.1 (General Gaps Model, [43]). Given an ordered labeled tree T with 
vertex set V and edge set E. A gap g is a tree with vertex set a subset of V and 
edges in E whose both end points lie in that subset. A node in g is called a gap node. 
Topologically, a gap is a subtree of T (see Eigure 3.1). 

The corresponding edit operations as: 

* More precisely, the decision problem: given two ordered labeled trees and a positive integer k, 
decide wether the general gap edit distance is bounded from above by k, is NP-hard. 
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Figure 3.1: Red nodes in this binary tree forms a gap. 

Definition 3.2.2 (Edit Operations). Here are the tree type of edit operations in the 
general gap model: 

1. Relabel a node; 

2. Delete a gap, and descendants of a gap will become children of the parent of 
the root of the gap; 

3. Insert a gap. 

Note that inserting (resp. deleting) a gap in one tree corresponds to deleting 
(resp. inserting) a gap in the other. 

Each edit operation has a nonnegative cost. Given two trees Ti,T 2 with vertex 
set Vi and 14, respectively. For the cost of relabeling, choose a metric (symmetric, 
positive definite and satisfies the triangle ineqnality) 

p:VixV2 —> (3.2.1) 

Then p{u,v) dehnes the cost of changing the label on n e Vi to that on n e V 2 . For 
the cost of deletion and insertion of gaps, first consider an arbitrary fnnction 

w:Z+ —> (3.2.2) 

snch that the cost of deleting or inserting a gap g is given by w{\g\), where \g\ denotes 
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the number of nodes in g. Thus, the cost of g only depends on the size of g. One could 
generalize this even further by considering functions depending on other properties 
of g (e.g. height, total degree, etc). 

Based on the heuristic that a large gap is more likely to occur than multiple iso¬ 
lated small gaps, convex gap cost functions are more suitable for our considerations: 

w{ki + k2) ^ w{ki) + w{k2)., V/ci, A;2 e . (3.2.3) 


for gap penalty. In particular, an affine function 


w{k) 


0 for /c = 0 

a + bk forks a ^ 0, 6 > 0 


(3.2.4) 


is a convex function (see Lemma 2.1.1). In general, the more complicated the gap 
cost function is, the more difficult the computations will be. In the following, we 
assume that all gap costs are affine unless otherwise specified. 

Definition 3.2.3 (Edit Script). Given two trees Ti and T 2 . An edit script S from 
Ti to T 2 is a sequence of edit operations S = {Si, 82 , - ■ ■ , 5'^} that transforms Ti to 
T 2 . The cost of S is defined to be 

n 

C(S):-Y,C(S,), (3.2.5) 

2^1 


where C(S'j) is the cost of the ith edit operation. 

Definition 3.2.4 (General Gap Tree Edit Distance). Given two ordered labeled trees 
Ti and T 2 . The general gap edit distance between Ti and T 2 is defined to be 


7 (Ti,T 2 ) := min{G(S')|S' is an edit script taking Ti to T 2 }. (3.2.6) 

A mapping between two trees, which is a graphical representation of an edit 
script, can be defined in exactly the same fashion as in classic edit distances (2.2.2): 



Definition 3.2.5 (Mapping Between Two Trees). Given two trees Ti and T 2 . A 
mapping between Ti and T 2 is a triple (M, Ti,T 2 ), where M is a subset of Vi x V 2 , 
such that for any {u,v) and {u',v') in M: 

(1) u = u' if and only if v = v', called the one-to-one condition. 

(2) u is to the left of u' if and only if v is to the left of v', called the sibling order 
condition. 

(3) u is an ancestor of u' if and only if v is an ancestor of v', called the ancestor 
order condition. 

Given a mapping M, define its cost to be: 

C{M):= Y, p^u,v) + Y<^ + b\9\, u e V, v e V 2 , (3.2.7) 

{u,v)eM geG 

where G is the set of all gaps in M. 

By Lemma 2.2.1, we still have 

7 (Ti,T 2 ) = 7 (M) := min{C(M)|M is a mapping from Ti to T 2 }. 

Therefore computing the edit distance is equivalent to computing the minimal cost 
mapping. 

3.2.2 Binary Tree Gase 

Since the computation of the general gap edit distance is NP-hard for arbitrary trees 
[43], we compute this distance for binary trees. We prove the following main theorem 
of this thesis: 

Theorem 3.1 (Main Theorem). Given two ordered labeled binary trees Ti and T 2 
with vertex set Vi := V(Ti) and V 2 := V(T 2 ) respectively, and an affine gap cost 
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function. Let m := |Vi| and n := IV 2 I. The general gap edit distance between Ti and 
T 2 can he computed in 0{m^n^ + m^n^) time. If m ^ n, then the running time is 
Oijmf’). 

We use a dynamic programming approach to prove this theorem, similar to Zhang 
and Shasha’s approach [47] in the classic edit distance case. Given a matching M 
and a pair of nodes {u, u) e Vi x V 2 . There are three possibilities: either u is matched 
to v] or M is a gap node; or v is gap node. Since the gap cost function is affine, the 
penalty for starting a gap is different from that for continuing a gap. Moreover, a 
gap node u is continuing a gap if and only if its parent node, denoted as p{u), is a 
gap node. Thus, to determine whether a gap node is starting or continuing a gap, we 
need the information about its parent node. This suggests that we order the nodes 
according to preorder traversals as apposed to postorder traversals in the classic edit 
distance. 

Order all the nodes in Ti and T 2 via preorder traversal and enumerate the nodes 
in Ti as 1, 2, • • • , m, and the nodes in T 2 as 1, 2, • • • , n. Identify Ti[i], the ith node, 
with its index i, i = 1, ■ ■ ■ m. Same for T 2 [j], j = 1, ■■■ ,n. Let Ti[i'..i] and T 2 [j'..j] 
be the subforests dehned in Section 2.2, and forestdist(Ti[i'..i],T 2 [j'..j]) be the edit 
distance between Ti[i'..i] and T 2 [j'..j]. Dehne three auxiliary functions: 

Definition 3.2.6. For 1 < i' < f < m, and 1 < j' < j < n, set: 

■= forestdist{Ti\i'..i],T 2 [j'..j]); 

< Q±*[i'..ijj'..j] := forestdist(Ti[i'..i],T 2 [j'..j]) such that i is a gap point, 

Q*i.[i'--hj'..j] := forestdist(Ti[i'..i],T 2 [j'..j]) such that j is a gap point 

(3.2.8) 

With this dehnition 

7 (Ti,T 2 ) = Q[l..m, l..n]. 
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We define the boundary conditions of the auxiliary functions as follows. Note 


hrst that = T[0] if i' < i. Set 


Moreover set 


Q[0,0] = 0, 
Q[l..i,0] = 00, 
Q[0,l..j] = 00, 

0] = a + bi, 
Ql*[0..1..j] = 00, 


(for 1 < i < m) 
(for 1 < j < n) 


(for 1 < i < m) 
(for 1 < j < n) 


since it is impossible to match an empty tree with T 2 [l..j] such that the formal ends 
with a gap node; and there is a unique matching between Ti[l..i] with an empty 
tree: we have i gap points. 

By symmetry, set 

Q*±[l..i, 0] = 00 , (for 1 < i < m) 

Q*±[0, l-.j] = a + bj, (for 1 < j < n) 


Theorem 3.2 (Recurrence of Auxiliary Matrices in General Gap Model for Binary 
Trees). Given the preorder ordering on the nodes of two ordered labeled trees Ti and 
T 2 . Fix nodes i\ e hi, ji e V 2 - For any i e desc{ii) and j e desc{ji), we have the 
following recurrence relations: 


= min ^ 


Q[ii-i - 101-3 - 1] +Vii0) 

QrAG-iGi-j] 

Q^iXii-iGi-j] 


QL*[ii-i0i-3] 


Q[ii-i - 1, + {a + b) 

QrAk-i - 1, + b 

min {Qj_0ii-p{i)0i-k] 

jisSfcsSj 

+Q[p{i) + l-i — l,k + l-j] + b} 


(3.2.9) 


(3.2.10) 
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-l] + (a + b) 

-l]+b 

min {Q*i[ii..kji..p{j)] 

+Q{k + l..i,p{i) + 1..] -l) + h] 


(3.2.11) 


Here p{i) (resp. p{j)) is the index of parent node (if exists) of i (resp. j). 


Proof. We prove recurrence (3.2.9) and (3.2.10). Recurrence (3.2.11) can be obtained 
by a symmetric argument. We first assume that both ii and ji have nontrivial 
siblings. 

• In the first recurrence of Q[ii..i,ji..j], there are three cases: 

(1) None of i or j is a gap point, then i must be matched to j, and 


Q[ii-i,ji..j] = Q[ii..i - l,ii--j - 1] +p(bi), 


where p{i,j) is the cost of matching i with j. 

(2) i is a gap point, then 


Q[ii..i,ji..j] = QL^[ii..i,ji..j - 1]. 

(3) Similarly if j is a gap point, then 

Q[ii--i,ji..j] = Q*±[ii-iJi..j]. 


The above exhaust all the possibilities, hence proves (3.2.9). 

• Next we prove the second recurrence in which i is a gap point. Let p{i) be the 
parent of i. If i is the root then p{i) := 0. For the moment we assume that i has a 
non-trivial sibling. There are three cases: 

(1) If p{i) = 0 or p{i) exists and is not a gap node, then i is starting a new gap 
and hence gets penalized with a + b: 


QL*[ii--i,ji..j] = Q[ii..i - l,ii..j] + (a + b). 
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(2) If p{i) is a gap node and i is its left child (see Figure 3.2). Then i is continuing 
a preexisting gap, hence only gets penalized with b. In the preorder ordering, p{i) = 
i — 1, therefore 

- 1 , 



Figure 3.2: p{i) is a gap node and i is its left child. Gap nodes are labeled black. 

(3) If p{i) is a gap node and i is its right child (see Figure 3.3). Then i is continuing 
a preexisting gap, hence gets penalized with b as well. Then the left child of p{i) 
has index p{i) + 1. The subforest Ti[zi..p(l)] is matched to a subforest T 2 [ji..k] with 
p(l) being a gap node, for some k e [ji, j] n Z; hence the subtree rooted at p{i) + 1 
will be matched with the remaining part of T 2 [ji..j], which is T 2 \k + l..j]. Therefore 
for ji < /c < j. 


Q_L=i=[G--h + Qivif) + l..i - 1, /c + l..j] + b. 

• To complete the proof, it is left to prove (3.2.10) where i is the only child of 
p{i) since it is immediate that (3.2.9) still holds in this case. In this case, p{i) = i — 1 
in the preorder ordering. There are two cases: 

(1) p{i) is not a gap point, and hence i is starting a new gap: 

Q±*[ii--i,ji..j] = - 1, ji, j] + (a + b). 
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Tj T2 

Figure 3.3: p{i) is a gap node and i is its right child. Gap nodes are labeled black. 


(2) p{i) is a gap point, then i is continuing a preexisting gap: 

- 1,+ b. 

Notice that Ti[p{i) + l..i — 1] = — 1] = 0. Consequently, Q[p{i) + l--i — 

l,k + l..j] = 00 by the boundary conditions for Q (for 1 < j < n). Therefore (3.2.10) 
still holds in this case. 

Combining with the above, we have proved recurrence (3.2.11). □ 

Algorithm for computing Q\l..m0..n\\ Let treedist(i,j) := Q[i..r(i), j..r(j)], 
where r{i) is the index of the rightmost leaf descendant of the subtree rooted at 
i. In particular, if i is a leaf, then r{i) = i. Thus, the edit distance between Ti and 
T 2 in the general gap model is treedist(m, n), and can be computed by algorithm 2 
below. 

Running Time Analysis: A crude upper bound for the time to compute the above 
recurrences can be computed as follows. We can preprocess r{i) and r{j) for each 
i e Vi,j e V 2 . Each computation can be done in linear time. The total time needed 
to compute Q[l{ii)..i,l{ji)..j], Qj_*[l{li)..i,l{ji)..j] and Q*i.[l{i)..i,l{ji)..j] is upper 
bounded by 

3 + 2 + m + 2 + ?7, = 7 + m + n. 
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Algorithm 2 General Gap Tree Edit Distance 
Input: Tree Ti and T 2 

Output: treedist(ii, ji), where 1 < ii < m := |Ti|, and 1 ^ ji ^ n := IT 2 I 
Preprocessing: Gompute the index of the parent of each node and the r function 
for = 1; < m,ji < n; ii++, ji++) do 

for {i = ii; i < i++) do 

for (j = ji; j < r(ji); j++) do 

Gompute treedist(ii, ji) by hrst compute 
then compute ji-.j] 

end for 
end for 
end for 


Since there are Oirri^in?) many subforest T[/(ii).i,the upper bound for 
computing all of these recurrence is: 


(7 + (m + n))rn^n^ = 0{m^n^ + m^n^). 


Hence Theorem 3.1 is proved. 


3.3 Complete Subtree Gap Tree Edit Distance 

In this section, we study a weaker model of gaps, first proposed by Touzet: 

Definition 3.3.1 (Gomplete Subtree Gap Model, [43]). Given a tree T with vertex 
set V. A gap ofT is the complete subtree rooted at some vertex v e V. 

Every gap in the complete subtree model is a gap in the general model, but not 
vice versa (see Figure 3.4). 

Touzet [43] computed the complete subtree gap edit distance using a product 
tree data structure. We present a different algorithm that is motivated by sequence 
alignment with gaps (2.1) and classic edit distance of Zhang and Shasha [47]. In 
particular, we prove that: 

Theorem 3.3 (Gomplete Subtree Gap Tree Edit Distance). Given two ordered la¬ 
beled binary trees Ti and T 2 with vertex set Vi := V (Ti) and V 2 := V (T 2 ) respectively, 
and an affine gap cost function. Let m := \Vi\ and n := lEl- The complete subtree 


35 



Figure 3.4: The subtree with red nodes on the left is a gap in the complete subtree 
model, since it is a complete subtree. The subtree with red nodes on the right is not 
a gap in the complete subtree model. However, it is a gap in the general model. 


gap edit distance between Ti and T 2 can he computed in 0{m?n^) time. Ifm^n, 
then the running time is 0{m^). 

Let m := |Ti| and n := IT 2 I. Order the nodes in Ti and T 2 via preorder traversal 
for the same reason as in the general gap case: a gap node is continuing a gap if and 
only if its parent is a gap node. Enumerate the nodes in Ti as 1, 2, ■ ■ ■ , m, and the 
nodes in T 2 as 1, 2, ■ ■ ■ , n, and identify each node (together with the labeling) with its 
index in this preorder ordering. Let Ti[i'..i], T 2 [j'..j], forestdist(Ti[i'..i],T 2 [j'..j]), 
Q) Ql* and be the same as in the general gap case. Our goal is again to compute: 

7 (Ti, T 2 ) = 7 (M) = Q[l..m, l..n]. 

Theorem 3.4 (Recurrence of Auxiliary Matrices in Complete Subtree Gap Model). 
Given the preorder ordering on the nodes of two ordered labeled trees Ti and T 2 . Fix 
nodes R e Vi, ji e V 2 - For any i e desc{ii) and j e desc{ji), we have the following 
recurrence relations: 


Q[ii..iGi--3] = min ^ 


Q[ii..i - 1, ji..j - 1] +p{iG) 

Qx*[h-dGi--3] 


(3.3.1) 










= min 


Q[ii..i - + {a + h) 


(3.3.2) 




- 1] + (a + 6) 

Q*i\ii-idi-P{j)\ + b{j -p{j)) 


(3.3.3) 


Here p{i) (resp. p{j)) is the index of parent node (if exists) of i (resp. j). 


Proof. We prove recurrence (3.3.1) and (3.3.2). Recurrence (3.3.3) can be obtained 
by a symmetric argument. 

• In the first recurrence of Q[ii..i,ji..j], there are three cases: 

(1) None of i or j is a gap point, then i must be matched to j (prove this!) and 
thus 

Q[ii-i,ji..j] = Q[ii..i - - 1] +p{id), 

where p{i,j) is the cost of matching i with j. 

(2) i is a gap point, then 


Q[ii--i,ji..j] = QL*[ii..i,ji..j]. 


(3) Similarly if j is a gap node, then 

Q[ii..iji..j] = Q^±[ii..iji..j]. 

The above exhaust all the possibilities, hence proves (3.3.1). 

• Next we prove the second recurrence in which i is a gap node. Let p{i) be the 
parent of i. If i is the root then p{i) := 0. There are two cases: 

(1) If p{i) = 0 or p{i) exists and is not a gap node, then i is starting a new gap 
and hence gets penalized with a + b: 


Qi*[ii--i,ji..j] = Q[ii..i - l,ii--j] + (a + h). 

(2) If p{i) exists and is a gap node (see Figure 3.5). Then by the complete subtree 
gap model, every descendent of p{i) must be a gap node as well. There are i — p{i) 
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such nodes. Since they are all continuing a preexisting gap, the total cost is b{i—p{i)). 
In particular if i is the only child of p(i), then i — p{i) = 1: we only penalize the 
node i. 



T, 

Figure 3.5: p{i) and i are both gap nodes. All the nodes in between them are gap 
nodes as well. 


In this case, we have: 


Q_L=i=[G--b = Q_L=i=[A--p(0>ii--j] +h{i-p{i)). 


□ 

Algorithm for computing Q[l..m, l..?7,]: Let treedist(i,j) := Q[b.r(i), j..r(j)]. 
Thus, the edit distance between Ti and T 2 in the general gap model is treedist(m, n), 
and can be computed by Algorithm 3 below. 

It is easy to see that the running time of this algorithm is (9(m^n^), which proves 
Theorem 3.3. 
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Algorithm 3 Complete Subtree Gap Tree Edit Distance 
Input: Tree Ti and T 2 

Output: treedist(ii, ji), where 1 < ^ m := |Ti|, and 1 < ji < n := IT 2 I 

Preprocessing: Compute the index of the parent of each node and the r function 
for = 1; *1 < m,ji < n; ii++, ji++) do 

for {i = ii; i ^ ’"(^ 1 ); *++) do 
for (j = ji, j ^ r(ji); j++) do 

Compute treedist(ii, ji) by hrst compute ji-.j], 

then compute Q=i=_L[?i.d, ji-.j] 
end for 
end for 
end for 



Figure 3.6: (a) 

Figure 3.8: Two smooth terrains that look similar. Both of them are local graphs 
of combinations of trigonometry functions. 



Figure 3.7: (b) 


3.4 Application of Complete Subtree Gap Tree Edit Distance to Ter¬ 
rain Comparisons 

In this section, we study the problem of comparing the similarities between two 
terrains. As discussed in the introduction (Section 1.1.2), contour trees are the 
underlying tree structures of terrains that capture the evolution of the connected 
components of the level sets, or contours. Thus the problem of comparing two 
terrains (see Figure 3.8) can be reduced to comparing their corresponding contour 
trees. 
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Roughly speaking, a terrain in is the graph of a height function on More 
precisely, given any smooth Morse function / : —>■ with isolated critical 

points. The graph of /, denoted as T/, is called a smooth terrain. 

Dehne := {x e M^|/(a;) < h} to be the points in the plane with height 

less than h, and Mh{f) '■= dM^hif) = {x e M^|/(a;) = h} to be the h-level set of /. 
A connected component of Mh{f) is called a contour. 

As we vary h, the h-level set changes topology only at critical points of /, which 
are local maximum, local minimum and saddle points. A contour hrst appears at 
a local minimum, disappears at a local maximum. At a saddle point, either two 
contours join and become a single contour, in which case the saddle is called negative-., 
or one contour splits into two, in which case the saddle is called positive. The contour 
tree of a terrain is dehned as follows: 

Definition 3.4.1 (Contour Tree). Given a smooth terrain Tf defined as above. The 
associated contour tree is a graph Cf whose nodes are eritical points of f, and there 
is an edge {u,v) if a contour appears at v and disappears at u. See Figure 3.9. 

The contour tree of a terrain was hrst dehned by Boyell and Ruston [8], and is 
in fact a tree (see [11]). Many research has been done [44, 42, 1] in computing the 
contour tree of a terrain in and in higher dimensions [10, 11]. Applications of the 
contonr trees have been stndied by [23, 28, 38, 41]. However, to onr best knowledge, 
there have been no study on the problem of comparing contour trees as a similarity 
measure of their corresponding terrains. 

In the following, we only consider pieeewise linear terrains, i.e. graphs of piecewise 
linear height fnnctions. More precisely, let M be a triangulation of and let V be 
the set of all vertices in M. Consider a height fnnction 

f -.V —> 
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8 


(a) 


(b) 


Figure 3.9: Smooth terrain (a) with its contour tree (b). Picture courtesy of P. 
Agarwal, L. Arge and K, Yi [1], 

dehned on the vertex set, such that / is one-to-one (i.e. no two vertices have the 
same height). Extend / to the entire plane in a piecewise linear fashion, and identify 
/ with its extension. Thus P/ a is piecewise linear terrain. In this case, all contours 
are closed polygonal curves. 

Given two triangulations Mi and M 2 of and height functions / and g dehned 
on the vertices and then extended linearly to We dehne the distance between P f 
and Pg to be the edit distance between the contour trees Cf and Cg. 

Recall that gaps are introduced in tree edit distance to be able to deal with noise 
in the input. Now the questions is: What should the gap model be in the case of 
contour trees? It’s easy to see that noise in the input terrain (i.e. “wiggles” on the 
surface) are rehected as complete subtrees in the contour tree. 

Here is the upshot: complete subtree gap edit distance can be used to compute 
the similarities between two contour trees, which can then be used as a measure 
of the original terrains. It is worth noting that this is only a topological measure 
of similarity, since the contour tree is a topological construct. Two terrains with 
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identical contour trees do not need to share the same geometry (e.g. curvature, area, 
geodesics, etc). 

There are several natural candidates for the cost of gaps. A topological penalty 
could be the persistence of noise in the terrain that corresponds to a gap in the 
contour tree. Geometric penalties could be the height or the volume of the noise in 
the terrain that correspond to a gap. We leave the understanding of which penalty 
function is better as well as the implementations to a future project. 

3.5 Further Improvements 

In both the general and the complete subtree gap models, gaps can have arbitrary 
sizes (up to the size of the tree). However in some applications, one usually has an 
upper bound on the size of the noise in the input, and hence on the size of gaps. 
A natural generalization is to incorporate this upper bound criteria in these gap 
models. Given a tree T and an integer k such that 0 < k ^ |T|. Dehne a gap g to 
be an arbitrary subtree with at most k nodes. When k = |T|, this is the general gap 
model. Gonsequently in our recurrences, when a node is continuing a gap, we need 
the additional check on whether the current gap size has exceeded k or not before 
penalizing the gap node. We leave more rigorous discussions on this improvement to 
a future project. 
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4 


Conclusion and Future Works 


In this thesis, we studied edit distance with gaps between two ordered, labeled trees. 
Touzet [43] proposed two gap models: the general model and the complete subtree 
model. Given two trees Ti and T 2 with m and n nodes respectively. We computed 
the general gap edit distance between binary trees in + w?n^) time, and the 

complete subtree gap edit distance between arbitrary trees in 0{'w}n^) time. Our 
dynamic programming algorithms are motivated by the classic sequence alignment 
[36] algorithms and Zhang and Shasha’s classic edit distance algorithm [47]. In 
both models, we assume that the gap cost function is affine. Prior to our work, no 
explicit algorithm was known in computing the general gap edit distance, since such 
computation is NP-hard for arbitrary trees (see [43]). We studied an application 
of the complete subtree gap edit distance in terrain comparison via comparing the 
similarities between the corresponding contour trees. 

The following are some open problems that are suitable for a future project: 

Problem 4.1. Recently S. Sankararaman, P. Agarwal and T. Molhave [31] studied 
the problem of comparing similarity between two trajectories sampled at a certain 
rate, using sequence alignment, which is a topological construction, and dynamic 
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time warping, which is a geometric construction. 

The gapped edit distance is by definition a topological measure of similarities be¬ 
tween trees. Is is possible to combine the edit distance with some geometric similarity 
measures (e.g. dynamic time warping) as in the trajectory alignment case? 

Problem 4.2. Our algorithm for computing the general gap edit distance between 
binary trees seems to suggest that the NP-hardness of computing this distance for 
arbitrary trees comes from the fact that the degrees or the branching factors of the 
internal nodes vary. Moreover, the running time should depend on the degree in an 
exponential fashion. A natural next step is to compute this distance for trees with 
fixed degrees (e.g. ternary trees). 

Problem 4.3. Our algorithms are too slow for any practical applications. Is it 
possible to simplify the algorithm by recognizing repetitions in the recurrences as in 
Zhang and Shasha’s work [47]? 
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